
<!DOCTYPE html>

<html lang="en">
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />

    <title>第二章 pandas基础 &#8212; Joyful Pandas 1.0 documentation</title>
<script>
  document.documentElement.dataset.mode = localStorage.getItem("mode") || "";
  document.documentElement.dataset.theme = localStorage.getItem("theme") || "light"
</script>

  <!-- Loaded before other Sphinx assets -->
  <link href="../_static/styles/theme.css?digest=92025949c220c2e29695" rel="stylesheet">
<link href="../_static/styles/pydata-sphinx-theme.css?digest=92025949c220c2e29695" rel="stylesheet">


  <link rel="stylesheet"
    href="../_static/vendor/fontawesome/5.13.0/css/all.min.css">
  <link rel="preload" as="font" type="font/woff2" crossorigin
    href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.woff2">
  <link rel="preload" as="font" type="font/woff2" crossorigin
    href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.woff2">

    <link rel="stylesheet" type="text/css" href="../_static/pygments.css" />
    <link rel="stylesheet" type="text/css" href="../_static/plot_directive.css" />
    <link rel="stylesheet" type="text/css" href="../_static/css/s4defs-roles.css" />

  <!-- Pre-loaded scripts that we'll load fully later -->
  <link rel="preload" as="script" href="../_static/scripts/pydata-sphinx-theme.js?digest=92025949c220c2e29695">

    <script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
    <script src="../_static/jquery.js"></script>
    <script src="../_static/underscore.js"></script>
    <script src="../_static/_sphinx_javascript_frameworks_compat.js"></script>
    <script src="../_static/doctools.js"></script>
    <script async="async" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
    <link rel="index" title="Index" href="../genindex.html" />
    <link rel="search" title="Search" href="../search.html" />
    <link rel="next" title="第三章 索引" href="ch3.html" />
    <link rel="prev" title="第一章 预备知识" href="ch1.html" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="docsearch:language" content="en">
  </head>
  
  
  <body data-spy="scroll" data-target="#bd-toc-nav" data-offset="180" data-default-mode="">
    <div class="bd-header-announcement container-fluid" id="banner">
      

    </div>

    
    <nav class="bd-header navbar navbar-light navbar-expand-lg bg-light fixed-top bd-navbar" id="navbar-main"><div class="bd-header__inner container-xl">

  <div id="navbar-start">
    
    
  


<a class="navbar-brand logo" href="../index.html">
  
  
  
  
    <img src="../_static/finallogo1.svg" class="logo__image only-light" alt="Logo image">
    <img src="../_static/finallogo1.svg" class="logo__image only-dark" alt="Logo image">
  
  
</a>
    
  </div>

  <button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbar-collapsible" aria-controls="navbar-collapsible" aria-expanded="false" aria-label="Toggle navigation">
    <span class="fas fa-bars"></span>
  </button>

  
  <div id="navbar-collapsible" class="col-lg-9 collapse navbar-collapse">
    <div id="navbar-center" class="mr-auto">
      
      <div class="navbar-center-item">
        <ul id="navbar-main-elements" class="navbar-nav">
    <li class="toctree-l1 nav-item">
 <a class="reference internal nav-link" href="../Home.html">
  Home
 </a>
</li>

<li class="toctree-l1 current active nav-item">
 <a class="reference internal nav-link" href="index.html">
  Content
 </a>
</li>

<li class="toctree-l1 nav-item">
 <a class="reference internal nav-link" href="../Author.html">
  Author
 </a>
</li>

<li class="toctree-l1 nav-item">
 <a class="reference internal nav-link" href="../Datawhale.html">
  Datawhale
 </a>
</li>

<li class="toctree-l1 nav-item">
 <a class="reference internal nav-link" href="../pandas%E6%95%B0%E6%8D%AE%E5%A4%84%E7%90%86%E4%B8%8E%E5%88%86%E6%9E%90.html">
  pandas数据处理与分析
 </a>
</li>

<li class="toctree-l1 nav-item">
 <a class="reference internal nav-link" href="../%E8%A1%A5%E5%85%85%E4%B9%A0%E9%A2%98.html">
  补充习题
 </a>
</li>

    
    <li class="nav-item">
        <a class="nav-link nav-external" href="https://pandas.pydata.org/docs/index.html">Doc<i class="fas fa-external-link-alt"></i></a>
    </li>
    
</ul>
      </div>
      
    </div>

    <div id="navbar-end">
      
      <div class="navbar-end-item">
        <span id="theme-switch" class="btn btn-sm btn-outline-primary navbar-btn rounded-circle">
    <a class="theme-switch" data-mode="light"><i class="fas fa-sun"></i></a>
    <a class="theme-switch" data-mode="dark"><i class="far fa-moon"></i></a>
    <a class="theme-switch" data-mode="auto"><i class="fas fa-adjust"></i></a>
</span>
      </div>
      
      <div class="navbar-end-item">
        <ul id="navbar-icon-links" class="navbar-nav" aria-label="Icon Links">
        <li class="nav-item">
          <a class="nav-link" href="https://github.com/datawhalechina/joyful-pandas" rel="noopener" target="_blank" title="GitHub"><span><i class="fab fa-github-square"></i></span>
            <label class="sr-only">GitHub</label></a>
        </li>
      </ul>
      </div>
      
    </div>
  </div>
</div>
    </nav>
    

    <div class="bd-container container-xl">
      <div class="bd-container__inner row">
          

<!-- Only show if we have sidebars configured, else just a small margin  -->
<div class="bd-sidebar-primary col-12 col-md-3 bd-sidebar">
  <div class="sidebar-start-items"><form class="bd-search d-flex align-items-center" action="../search.html" method="get">
  <i class="icon fas fa-search"></i>
  <input type="search" class="form-control" name="q" id="search-input" placeholder="Search the docs ..." aria-label="Search the docs ..." autocomplete="off" >
</form><nav class="bd-links" id="bd-docs-nav" aria-label="Main navigation">
  <div class="bd-toc-item active">
    <ul class="current nav bd-sidenav">
 <li class="toctree-l1">
  <a class="reference internal" href="ch1.html">
   第一章 预备知识
  </a>
 </li>
 <li class="toctree-l1 current active">
  <a class="current reference internal" href="#">
   第二章 pandas基础
  </a>
 </li>
 <li class="toctree-l1">
  <a class="reference internal" href="ch3.html">
   第三章 索引
  </a>
 </li>
 <li class="toctree-l1">
  <a class="reference internal" href="ch4.html">
   第四章 分组
  </a>
 </li>
 <li class="toctree-l1">
  <a class="reference internal" href="ch5.html">
   第五章 变形
  </a>
 </li>
 <li class="toctree-l1">
  <a class="reference internal" href="ch6.html">
   第六章 连接
  </a>
 </li>
 <li class="toctree-l1">
  <a class="reference internal" href="ch7.html">
   第七章 缺失数据
  </a>
 </li>
 <li class="toctree-l1">
  <a class="reference internal" href="ch8.html">
   第八章 文本数据
  </a>
 </li>
 <li class="toctree-l1">
  <a class="reference internal" href="ch9.html">
   第九章 分类数据
  </a>
 </li>
 <li class="toctree-l1">
  <a class="reference internal" href="ch10.html">
   第十章 时序数据
  </a>
 </li>
 <li class="toctree-l1">
  <a class="reference internal" href="%E5%8F%82%E8%80%83%E7%AD%94%E6%A1%88.html">
   参考答案
  </a>
 </li>
</ul>

  </div>
</nav>
  </div>
  <div class="sidebar-end-items">
  </div>
</div>


          


<div class="bd-sidebar-secondary d-none d-xl-block col-xl-2 bd-toc">
  
    
    <div class="toc-item">
      
<div class="tocsection onthispage mt-5 pt-1 pb-3">
    <i class="fas fa-list"></i> On this page
</div>

<nav id="bd-toc-nav">
    <ul class="visible nav section-nav flex-column">
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#id1">
   一、文件的读取和写入
  </a>
  <ul class="nav section-nav flex-column">
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#id2">
     1. 文件读取
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#id3">
     2. 数据写入
    </a>
   </li>
  </ul>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#id4">
   二、基本数据结构
  </a>
  <ul class="nav section-nav flex-column">
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#series">
     1. Series
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#dataframe">
     2. DataFrame
    </a>
   </li>
  </ul>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#id5">
   三、常用基本函数
  </a>
  <ul class="nav section-nav flex-column">
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#id6">
     1. 汇总函数
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#id7">
     2. 特征统计函数
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#id8">
     3. 唯一值函数
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#id9">
     4. 替换函数
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#id10">
     5. 排序函数
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#apply">
     6. apply方法
    </a>
   </li>
  </ul>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#id11">
   四、窗口对象
  </a>
  <ul class="nav section-nav flex-column">
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#id12">
     1. 滑窗对象
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#id13">
     2. 扩张窗口
    </a>
   </li>
  </ul>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#id14">
   五、练习
  </a>
  <ul class="nav section-nav flex-column">
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#ex1">
     Ex1：口袋妖怪数据集
    </a>
   </li>
   <li class="toc-h3 nav-item toc-entry">
    <a class="reference internal nav-link" href="#ex2">
     Ex2：指数加权窗口
    </a>
   </li>
  </ul>
 </li>
</ul>

</nav>
    </div>
    
    <div class="toc-item">
      
    </div>
    
  
</div>


          
          
          <div class="bd-content col-12 col-md-9 col-xl-7">
              
              <article class="bd-article" role="main">
                
  <section id="pandas">
<h1>第二章 pandas基础<a class="headerlink" href="#pandas" title="Permalink to this heading">#</a></h1>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [1]: </span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="gp">In [2]: </span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
</pre></div>
</div>
<p>在开始学习前，请保证 <code class="docutils literal notranslate"><span class="pre">pandas</span></code> 的版本号不低于如下所示的版本，否则请务必升级！请确认已经安装了 <code class="docutils literal notranslate"><span class="pre">xlrd,</span> <span class="pre">xlwt,</span> <span class="pre">openpyxl</span></code> 这三个包。其中，当pandas版本为1.2.x时，xlrd版本不得高于 <code class="docutils literal notranslate"><span class="pre">2.0.0</span></code> 。若pandas版本在1.3.x或以上时，xlrd正常安装即可。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [3]: </span><span class="n">pd</span><span class="o">.</span><span class="n">__version__</span>
<span class="gh">Out[3]: </span><span class="go">&#39;1.2.0&#39;</span>
</pre></div>
</div>
<section id="id1">
<h2>一、文件的读取和写入<a class="headerlink" href="#id1" title="Permalink to this heading">#</a></h2>
<section id="id2">
<h3>1. 文件读取<a class="headerlink" href="#id2" title="Permalink to this heading">#</a></h3>
<p><code class="docutils literal notranslate"><span class="pre">pandas</span></code> 可以读取的文件格式有很多，这里主要介绍读取 <code class="docutils literal notranslate"><span class="pre">csv,</span> <span class="pre">excel,</span> <span class="pre">txt</span></code> 文件。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [4]: </span><span class="n">df_csv</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/my_csv.csv&#39;</span><span class="p">)</span>

<span class="gp">In [5]: </span><span class="n">df_csv</span>
<span class="gh">Out[5]: </span>
<span class="go">   col1 col2  col3    col4      col5</span>
<span class="go">0     2    a   1.4   apple  2020/1/1</span>
<span class="go">1     3    b   3.4  banana  2020/1/2</span>
<span class="go">2     6    c   2.5  orange  2020/1/5</span>
<span class="go">3     5    d   3.2   lemon  2020/1/7</span>

<span class="gp">In [6]: </span><span class="n">df_txt</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">&#39;data/my_table.txt&#39;</span><span class="p">)</span>

<span class="gp">In [7]: </span><span class="n">df_txt</span>
<span class="gh">Out[7]: </span>
<span class="go">   col1 col2  col3             col4</span>
<span class="go">0     2    a   1.4   apple 2020/1/1</span>
<span class="go">1     3    b   3.4  banana 2020/1/2</span>
<span class="go">2     6    c   2.5  orange 2020/1/5</span>
<span class="go">3     5    d   3.2   lemon 2020/1/7</span>

<span class="gp">In [8]: </span><span class="n">df_excel</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;data/my_excel.xlsx&#39;</span><span class="p">)</span>

<span class="gp">In [9]: </span><span class="n">df_excel</span>
<span class="gh">Out[9]: </span>
<span class="go">   col1 col2  col3    col4      col5</span>
<span class="go">0     2    a   1.4   apple  2020/1/1</span>
<span class="go">1     3    b   3.4  banana  2020/1/2</span>
<span class="go">2     6    c   2.5  orange  2020/1/5</span>
<span class="go">3     5    d   3.2   lemon  2020/1/7</span>
</pre></div>
</div>
<p>这里有一些常用的公共参数， <code class="docutils literal notranslate"><span class="pre">header=None</span></code> 表示第一行不作为列名， <code class="docutils literal notranslate"><span class="pre">index_col</span></code> 表示把某一列或几列作为索引，索引的内容将会在第三章进行详述， <code class="docutils literal notranslate"><span class="pre">usecols</span></code> 表示读取列的集合，默认读取所有的列， <code class="docutils literal notranslate"><span class="pre">parse_dates</span></code> 表示需要转化为时间的列，关于时间序列的有关内容将在第十章讲解， <code class="docutils literal notranslate"><span class="pre">nrows</span></code> 表示读取的数据行数。上面这些参数在上述的三个函数里都可以使用。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [10]: </span><span class="n">pd</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">&#39;data/my_table.txt&#39;</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span>
<span class="gh">Out[10]: </span>
<span class="go">      0     1     2                3</span>
<span class="go">0  col1  col2  col3             col4</span>
<span class="go">1     2     a   1.4   apple 2020/1/1</span>
<span class="go">2     3     b   3.4  banana 2020/1/2</span>
<span class="go">3     6     c   2.5  orange 2020/1/5</span>
<span class="go">4     5     d   3.2   lemon 2020/1/7</span>

<span class="gp">In [11]: </span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/my_csv.csv&#39;</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;col1&#39;</span><span class="p">,</span> <span class="s1">&#39;col2&#39;</span><span class="p">])</span>
<span class="gh">Out[11]: </span>
<span class="go">           col3    col4      col5</span>
<span class="go">col1 col2                        </span>
<span class="go">2    a      1.4   apple  2020/1/1</span>
<span class="go">3    b      3.4  banana  2020/1/2</span>
<span class="go">6    c      2.5  orange  2020/1/5</span>
<span class="go">5    d      3.2   lemon  2020/1/7</span>

<span class="gp">In [12]: </span><span class="n">pd</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">&#39;data/my_table.txt&#39;</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;col1&#39;</span><span class="p">,</span> <span class="s1">&#39;col2&#39;</span><span class="p">])</span>
<span class="gh">Out[12]: </span>
<span class="go">   col1 col2</span>
<span class="go">0     2    a</span>
<span class="go">1     3    b</span>
<span class="go">2     6    c</span>
<span class="go">3     5    d</span>

<span class="gp">In [13]: </span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/my_csv.csv&#39;</span><span class="p">,</span> <span class="n">parse_dates</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;col5&#39;</span><span class="p">])</span>
<span class="gh">Out[13]: </span>
<span class="go">   col1 col2  col3    col4       col5</span>
<span class="go">0     2    a   1.4   apple 2020-01-01</span>
<span class="go">1     3    b   3.4  banana 2020-01-02</span>
<span class="go">2     6    c   2.5  orange 2020-01-05</span>
<span class="go">3     5    d   3.2   lemon 2020-01-07</span>

<span class="gp">In [14]: </span><span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">&#39;data/my_excel.xlsx&#39;</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="gh">Out[14]: </span>
<span class="go">   col1 col2  col3    col4      col5</span>
<span class="go">0     2    a   1.4   apple  2020/1/1</span>
<span class="go">1     3    b   3.4  banana  2020/1/2</span>
</pre></div>
</div>
<p>在读取 <code class="docutils literal notranslate"><span class="pre">txt</span></code> 文件时，经常遇到分隔符非空格的情况， <code class="docutils literal notranslate"><span class="pre">read_table</span></code> 有一个分割参数 <code class="docutils literal notranslate"><span class="pre">sep</span></code> ，它使得用户可以自定义分割符号，进行 <code class="docutils literal notranslate"><span class="pre">txt</span></code> 数据的读取。例如，下面的读取的表以 <code class="docutils literal notranslate"><span class="pre">||||</span></code> 为分割：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [15]: </span><span class="n">pd</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">&#39;data/my_table_special_sep.txt&#39;</span><span class="p">)</span>
<span class="gh">Out[15]: </span>
<span class="go">              col1 |||| col2</span>
<span class="go">0  TS |||| This is an apple.</span>
<span class="go">1    GQ |||| My name is Bob.</span>
<span class="go">2         WT |||| Well done!</span>
<span class="go">3    PT |||| May I help you?</span>
</pre></div>
</div>
<p>上面的结果显然不是理想的，这时可以使用 <code class="docutils literal notranslate"><span class="pre">sep</span></code> ，同时需要指定引擎为 <code class="docutils literal notranslate"><span class="pre">python</span></code> ：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [16]: </span><span class="n">pd</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">&#39;data/my_table_special_sep.txt&#39;</span><span class="p">,</span>
<span class="gp">   ....: </span>              <span class="n">sep</span><span class="o">=</span><span class="s1">&#39; \|\|\|\| &#39;</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s1">&#39;python&#39;</span><span class="p">)</span>
<span class="gp">   ....: </span>
<span class="gh">Out[16]: </span>
<span class="go">  col1               col2</span>
<span class="go">0   TS  This is an apple.</span>
<span class="go">1   GQ    My name is Bob.</span>
<span class="go">2   WT         Well done!</span>
<span class="go">3   PT    May I help you?</span>
</pre></div>
</div>
<div class="caution admonition">
<p class="admonition-title"><code class="docutils literal notranslate"><span class="pre">sep</span></code> 是正则参数</p>
<blockquote>
<div><p>在使用 <code class="docutils literal notranslate"><span class="pre">read_table</span></code> 的时候需要注意，参数 <code class="docutils literal notranslate"><span class="pre">sep</span></code> 中使用的是正则表达式，因此需要对 <code class="docutils literal notranslate"><span class="pre">|</span></code> 进行转义变成 <code class="docutils literal notranslate"><span class="pre">\|</span></code> ，否则无法读取到正确的结果。有关正则表达式的基本内容可以参考第八章或者其他相关资料。</p>
</div></blockquote>
</div>
</section>
<section id="id3">
<h3>2. 数据写入<a class="headerlink" href="#id3" title="Permalink to this heading">#</a></h3>
<p>一般在数据写入中，最常用的操作是把 <code class="docutils literal notranslate"><span class="pre">index</span></code> 设置为 <code class="docutils literal notranslate"><span class="pre">False</span></code> ，特别当索引没有特殊意义的时候，这样的行为能把索引在保存的时候去除。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [17]: </span><span class="n">df_csv</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;data/my_csv_saved.csv&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>

<span class="gp">In [18]: </span><span class="n">df_excel</span><span class="o">.</span><span class="n">to_excel</span><span class="p">(</span><span class="s1">&#39;data/my_excel_saved.xlsx&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</pre></div>
</div>
<p><code class="docutils literal notranslate"><span class="pre">pandas</span></code> 中没有定义 <code class="docutils literal notranslate"><span class="pre">to_table</span></code> 函数，但是 <code class="docutils literal notranslate"><span class="pre">to_csv</span></code> 可以保存为 <code class="docutils literal notranslate"><span class="pre">txt</span></code> 文件，并且允许自定义分隔符，常用制表符 <code class="docutils literal notranslate"><span class="pre">\t</span></code> 分割：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [19]: </span><span class="n">df_txt</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">&#39;data/my_txt_saved.txt&#39;</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\t</span><span class="s1">&#39;</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</pre></div>
</div>
<p>如果想要把表格快速转换为 <code class="docutils literal notranslate"><span class="pre">markdown</span></code> 和 <code class="docutils literal notranslate"><span class="pre">latex</span></code> 语言，可以使用 <code class="docutils literal notranslate"><span class="pre">to_markdown</span></code> 和 <code class="docutils literal notranslate"><span class="pre">to_latex</span></code> 函数，此处需要安装 <code class="docutils literal notranslate"><span class="pre">tabulate</span></code> 包。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [20]: </span><span class="nb">print</span><span class="p">(</span><span class="n">df_csv</span><span class="o">.</span><span class="n">to_markdown</span><span class="p">())</span>
<span class="go">|    |   col1 | col2   |   col3 | col4   | col5     |</span>
<span class="go">|---:|-------:|:-------|-------:|:-------|:---------|</span>
<span class="go">|  0 |      2 | a      |    1.4 | apple  | 2020/1/1 |</span>
<span class="go">|  1 |      3 | b      |    3.4 | banana | 2020/1/2 |</span>
<span class="go">|  2 |      6 | c      |    2.5 | orange | 2020/1/5 |</span>
<span class="go">|  3 |      5 | d      |    3.2 | lemon  | 2020/1/7 |</span>

<span class="gp">In [21]: </span><span class="nb">print</span><span class="p">(</span><span class="n">df_csv</span><span class="o">.</span><span class="n">to_latex</span><span class="p">())</span>
<span class="go">\begin{tabular}{lrlrll}</span>
<span class="go">\toprule</span>
<span class="go">{} &amp;  col1 &amp; col2 &amp;  col3 &amp;    col4 &amp;      col5 \\</span>
<span class="go">\midrule</span>
<span class="go">0 &amp;     2 &amp;    a &amp;   1.4 &amp;   apple &amp;  2020/1/1 \\</span>
<span class="go">1 &amp;     3 &amp;    b &amp;   3.4 &amp;  banana &amp;  2020/1/2 \\</span>
<span class="go">2 &amp;     6 &amp;    c &amp;   2.5 &amp;  orange &amp;  2020/1/5 \\</span>
<span class="go">3 &amp;     5 &amp;    d &amp;   3.2 &amp;   lemon &amp;  2020/1/7 \\</span>
<span class="go">\bottomrule</span>
<span class="go">\end{tabular}</span>
</pre></div>
</div>
</section>
</section>
<section id="id4">
<h2>二、基本数据结构<a class="headerlink" href="#id4" title="Permalink to this heading">#</a></h2>
<p><code class="docutils literal notranslate"><span class="pre">pandas</span></code> 中具有两种基本的数据存储结构，存储一维 <code class="docutils literal notranslate"><span class="pre">values</span></code> 的 <code class="docutils literal notranslate"><span class="pre">Series</span></code> 和存储二维 <code class="docutils literal notranslate"><span class="pre">values</span></code> 的 <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> ，在这两种结构上定义了很多的属性和方法。</p>
<section id="series">
<h3>1. Series<a class="headerlink" href="#series" title="Permalink to this heading">#</a></h3>
<p><code class="docutils literal notranslate"><span class="pre">Series</span></code> 一般由四个部分组成，分别是序列的值 <code class="docutils literal notranslate"><span class="pre">data</span></code> 、索引 <code class="docutils literal notranslate"><span class="pre">index</span></code> 、存储类型 <code class="docutils literal notranslate"><span class="pre">dtype</span></code> 、序列的名字 <code class="docutils literal notranslate"><span class="pre">name</span></code> 。其中，索引也可以指定它的名字，默认为空。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [22]: </span><span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">data</span> <span class="o">=</span> <span class="p">[</span><span class="mi">100</span><span class="p">,</span> <span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="p">{</span><span class="s1">&#39;dic1&#39;</span><span class="p">:</span><span class="mi">5</span><span class="p">}],</span>
<span class="gp">   ....: </span>              <span class="n">index</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Index</span><span class="p">([</span><span class="s1">&#39;id1&#39;</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="s1">&#39;third&#39;</span><span class="p">],</span> <span class="n">name</span><span class="o">=</span><span class="s1">&#39;my_idx&#39;</span><span class="p">),</span>
<span class="gp">   ....: </span>              <span class="n">dtype</span> <span class="o">=</span> <span class="s1">&#39;object&#39;</span><span class="p">,</span>
<span class="gp">   ....: </span>              <span class="n">name</span> <span class="o">=</span> <span class="s1">&#39;my_name&#39;</span><span class="p">)</span>
<span class="gp">   ....: </span>

<span class="gp">In [23]: </span><span class="n">s</span>
<span class="gh">Out[23]: </span>
<span class="go">my_idx</span>
<span class="go">id1              100</span>
<span class="go">20                 a</span>
<span class="go">third    {&#39;dic1&#39;: 5}</span>
<span class="go">Name: my_name, dtype: object</span>
</pre></div>
</div>
<div class="note admonition">
<p class="admonition-title"><code class="docutils literal notranslate"><span class="pre">object</span></code> 类型</p>
<blockquote>
<div><p><code class="docutils literal notranslate"><span class="pre">object</span></code> 代表了一种混合类型，正如上面的例子中存储了整数、字符串以及 <code class="docutils literal notranslate"><span class="pre">Python</span></code> 的字典数据结构。此外，目前 <code class="docutils literal notranslate"><span class="pre">pandas</span></code> 把纯字符串序列也默认认为是一种 <code class="docutils literal notranslate"><span class="pre">object</span></code> 类型的序列，但它也可以用 <code class="docutils literal notranslate"><span class="pre">string</span></code> 类型存储，文本序列的内容会在第八章中讨论。</p>
</div></blockquote>
</div>
<p>对于这些属性，可以通过 <code class="docutils literal notranslate"><span class="pre">.</span></code> 的方式来获取：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [24]: </span><span class="n">s</span><span class="o">.</span><span class="n">values</span>
<span class="gh">Out[24]: </span><span class="go">array([100, &#39;a&#39;, {&#39;dic1&#39;: 5}], dtype=object)</span>

<span class="gp">In [25]: </span><span class="n">s</span><span class="o">.</span><span class="n">index</span>
<span class="gh">Out[25]: </span><span class="go">Index([&#39;id1&#39;, 20, &#39;third&#39;], dtype=&#39;object&#39;, name=&#39;my_idx&#39;)</span>

<span class="gp">In [26]: </span><span class="n">s</span><span class="o">.</span><span class="n">dtype</span>
<span class="gh">Out[26]: </span><span class="go">dtype(&#39;O&#39;)</span>

<span class="gp">In [27]: </span><span class="n">s</span><span class="o">.</span><span class="n">name</span>
<span class="gh">Out[27]: </span><span class="go">&#39;my_name&#39;</span>
</pre></div>
</div>
<p>利用 <code class="docutils literal notranslate"><span class="pre">.shape</span></code> 可以获取序列的长度：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [28]: </span><span class="n">s</span><span class="o">.</span><span class="n">shape</span>
<span class="gh">Out[28]: </span><span class="go">(3,)</span>
</pre></div>
</div>
<p>索引是 <code class="docutils literal notranslate"><span class="pre">pandas</span></code> 中最重要的概念之一，它将在第三章中被详细地讨论。如果想要取出单个索引对应的值，可以通过 <code class="docutils literal notranslate"><span class="pre">[index_item]</span></code> 可以取出。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [29]: </span><span class="n">s</span><span class="p">[</span><span class="s1">&#39;third&#39;</span><span class="p">]</span>
<span class="gh">Out[29]: </span><span class="go">{&#39;dic1&#39;: 5}</span>
</pre></div>
</div>
</section>
<section id="dataframe">
<h3>2. DataFrame<a class="headerlink" href="#dataframe" title="Permalink to this heading">#</a></h3>
<p><code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> 在 <code class="docutils literal notranslate"><span class="pre">Series</span></code> 的基础上增加了列索引，一个数据框可以由二维的 <code class="docutils literal notranslate"><span class="pre">data</span></code> 与行列索引来构造：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [30]: </span><span class="n">data</span> <span class="o">=</span> <span class="p">[[</span><span class="mi">1</span><span class="p">,</span> <span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="mf">1.2</span><span class="p">],</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="s1">&#39;b&#39;</span><span class="p">,</span> <span class="mf">2.2</span><span class="p">],</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="s1">&#39;c&#39;</span><span class="p">,</span> <span class="mf">3.2</span><span class="p">]]</span>

<span class="gp">In [31]: </span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="p">,</span>
<span class="gp">   ....: </span>                  <span class="n">index</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;row_</span><span class="si">%d</span><span class="s1">&#39;</span><span class="o">%</span><span class="k">i</span> for i in range(3)],
<span class="gp">   ....: </span>                  <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;col_0&#39;</span><span class="p">,</span> <span class="s1">&#39;col_1&#39;</span><span class="p">,</span> <span class="s1">&#39;col_2&#39;</span><span class="p">])</span>
<span class="gp">   ....: </span>

<span class="gp">In [32]: </span><span class="n">df</span>
<span class="gh">Out[32]: </span>
<span class="go">       col_0 col_1  col_2</span>
<span class="go">row_0      1     a    1.2</span>
<span class="go">row_1      2     b    2.2</span>
<span class="go">row_2      3     c    3.2</span>
</pre></div>
</div>
<p>但一般而言，更多的时候会采用从列索引名到数据的映射来构造数据框，同时再加上行索引：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [33]: </span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;col_0&#39;</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">],</span> <span class="s1">&#39;col_1&#39;</span><span class="p">:</span><span class="nb">list</span><span class="p">(</span><span class="s1">&#39;abc&#39;</span><span class="p">),</span>
<span class="gp">   ....: </span>                          <span class="s1">&#39;col_2&#39;</span><span class="p">:</span> <span class="p">[</span><span class="mf">1.2</span><span class="p">,</span> <span class="mf">2.2</span><span class="p">,</span> <span class="mf">3.2</span><span class="p">]},</span>
<span class="gp">   ....: </span>                  <span class="n">index</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;row_</span><span class="si">%d</span><span class="s1">&#39;</span><span class="o">%</span><span class="k">i</span> for i in range(3)])
<span class="gp">   ....: </span>

<span class="gp">In [34]: </span><span class="n">df</span>
<span class="gh">Out[34]: </span>
<span class="go">       col_0 col_1  col_2</span>
<span class="go">row_0      1     a    1.2</span>
<span class="go">row_1      2     b    2.2</span>
<span class="go">row_2      3     c    3.2</span>
</pre></div>
</div>
<p>由于这种映射关系，在 <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> 中可以用 <code class="docutils literal notranslate"><span class="pre">[col_name]</span></code> 与 <code class="docutils literal notranslate"><span class="pre">[col_list]</span></code> 来取出相应的列与由多个列组成的表，结果分别为 <code class="docutils literal notranslate"><span class="pre">Series</span></code> 和 <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> ：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [35]: </span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;col_0&#39;</span><span class="p">]</span>
<span class="gh">Out[35]: </span>
<span class="go">row_0    1</span>
<span class="go">row_1    2</span>
<span class="go">row_2    3</span>
<span class="go">Name: col_0, dtype: int64</span>

<span class="gp">In [36]: </span><span class="n">df</span><span class="p">[[</span><span class="s1">&#39;col_0&#39;</span><span class="p">,</span> <span class="s1">&#39;col_1&#39;</span><span class="p">]]</span>
<span class="gh">Out[36]: </span>
<span class="go">       col_0 col_1</span>
<span class="go">row_0      1     a</span>
<span class="go">row_1      2     b</span>
<span class="go">row_2      3     c</span>
</pre></div>
</div>
<p>与 <code class="docutils literal notranslate"><span class="pre">Series</span></code> 类似，在数据框中同样可以取出相应的属性：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [37]: </span><span class="n">df</span><span class="o">.</span><span class="n">values</span>
<span class="gh">Out[37]: </span>
<span class="go">array([[1, &#39;a&#39;, 1.2],</span>
<span class="go">       [2, &#39;b&#39;, 2.2],</span>
<span class="go">       [3, &#39;c&#39;, 3.2]], dtype=object)</span>

<span class="gp">In [38]: </span><span class="n">df</span><span class="o">.</span><span class="n">index</span>
<span class="gh">Out[38]: </span><span class="go">Index([&#39;row_0&#39;, &#39;row_1&#39;, &#39;row_2&#39;], dtype=&#39;object&#39;)</span>

<span class="gp">In [39]: </span><span class="n">df</span><span class="o">.</span><span class="n">columns</span>
<span class="gh">Out[39]: </span><span class="go">Index([&#39;col_0&#39;, &#39;col_1&#39;, &#39;col_2&#39;], dtype=&#39;object&#39;)</span>

<span class="gp">In [40]: </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span> <span class="c1"># 返回的是值为相应列数据类型的Series</span>
<span class="gh">Out[40]: </span>
<span class="go">col_0      int64</span>
<span class="go">col_1     object</span>
<span class="go">col_2    float64</span>
<span class="go">dtype: object</span>

<span class="gp">In [41]: </span><span class="n">df</span><span class="o">.</span><span class="n">shape</span>
<span class="gh">Out[41]: </span><span class="go">(3, 3)</span>
</pre></div>
</div>
<p>通过 <code class="docutils literal notranslate"><span class="pre">.T</span></code> 可以把 <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> 进行转置：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [42]: </span><span class="n">df</span><span class="o">.</span><span class="n">T</span>
<span class="gh">Out[42]: </span>
<span class="go">      row_0 row_1 row_2</span>
<span class="go">col_0     1     2     3</span>
<span class="go">col_1     a     b     c</span>
<span class="go">col_2   1.2   2.2   3.2</span>
</pre></div>
</div>
</section>
</section>
<section id="id5">
<h2>三、常用基本函数<a class="headerlink" href="#id5" title="Permalink to this heading">#</a></h2>
<p>为了进行举例说明，在接下来的部分和其余章节都将会使用一份 <code class="docutils literal notranslate"><span class="pre">learn_pandas.csv</span></code> 的虚拟数据集，它记录了四所学校学生的体测个人信息。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [43]: </span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/learn_pandas.csv&#39;</span><span class="p">)</span>

<span class="gp">In [44]: </span><span class="n">df</span><span class="o">.</span><span class="n">columns</span>
<span class="gh">Out[44]: </span>
<span class="go">Index([&#39;School&#39;, &#39;Grade&#39;, &#39;Name&#39;, &#39;Gender&#39;, &#39;Height&#39;, &#39;Weight&#39;, &#39;Transfer&#39;,</span>
<span class="go">       &#39;Test_Number&#39;, &#39;Test_Date&#39;, &#39;Time_Record&#39;],</span>
<span class="go">      dtype=&#39;object&#39;)</span>
</pre></div>
</div>
<p>上述列名依次代表学校、年级、姓名、性别、身高、体重、是否为转系生、体测场次、测试时间、1000米成绩，本章只需使用其中的前七列。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [45]: </span><span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">[:</span><span class="mi">7</span><span class="p">]]</span>
</pre></div>
</div>
<section id="id6">
<h3>1. 汇总函数<a class="headerlink" href="#id6" title="Permalink to this heading">#</a></h3>
<p><code class="docutils literal notranslate"><span class="pre">head,</span> <span class="pre">tail</span></code> 函数分别表示返回表或者序列的前 <code class="docutils literal notranslate"><span class="pre">n</span></code> 行和后 <code class="docutils literal notranslate"><span class="pre">n</span></code> 行，其中 <code class="docutils literal notranslate"><span class="pre">n</span></code> 默认为5：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [46]: </span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="gh">Out[46]: </span>
<span class="go">                          School     Grade            Name  Gender  Height  Weight Transfer</span>
<span class="go">0  Shanghai Jiao Tong University  Freshman    Gaopeng Yang  Female   158.9    46.0        N</span>
<span class="go">1              Peking University  Freshman  Changqiang You    Male   166.5    70.0        N</span>

<span class="gp">In [47]: </span><span class="n">df</span><span class="o">.</span><span class="n">tail</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
<span class="gh">Out[47]: </span>
<span class="go">                            School      Grade            Name  Gender  Height  Weight Transfer</span>
<span class="go">197  Shanghai Jiao Tong University     Senior  Chengqiang Chu  Female   153.9    45.0        N</span>
<span class="go">198  Shanghai Jiao Tong University     Senior   Chengmei Shen    Male   175.3    71.0        N</span>
<span class="go">199            Tsinghua University  Sophomore     Chunpeng Lv    Male   155.7    51.0        N</span>
</pre></div>
</div>
<p><code class="docutils literal notranslate"><span class="pre">info,</span> <span class="pre">describe</span></code> 分别返回表的 <span class="red">信息概况</span> 和表中 <span class="red">数值列对应的主要统计量</span> ：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [48]: </span><span class="n">df</span><span class="o">.</span><span class="n">info</span><span class="p">()</span>
<span class="go">&lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;</span>
<span class="go">RangeIndex: 200 entries, 0 to 199</span>
<span class="go">Data columns (total 7 columns):</span>
<span class="go"> #   Column    Non-Null Count  Dtype  </span>
<span class="go">---  ------    --------------  -----  </span>
<span class="go"> 0   School    200 non-null    object </span>
<span class="go"> 1   Grade     200 non-null    object </span>
<span class="go"> 2   Name      200 non-null    object </span>
<span class="go"> 3   Gender    200 non-null    object </span>
<span class="go"> 4   Height    183 non-null    float64</span>
<span class="go"> 5   Weight    189 non-null    float64</span>
<span class="go"> 6   Transfer  188 non-null    object </span>
<span class="go">dtypes: float64(2), object(5)</span>
<span class="go">memory usage: 11.1+ KB</span>

<span class="gp">In [49]: </span><span class="n">df</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>
<span class="gh">Out[49]: </span>
<span class="go">           Height      Weight</span>
<span class="go">count  183.000000  189.000000</span>
<span class="go">mean   163.218033   55.015873</span>
<span class="go">std      8.608879   12.824294</span>
<span class="go">min    145.400000   34.000000</span>
<span class="go">25%    157.150000   46.000000</span>
<span class="go">50%    161.900000   51.000000</span>
<span class="go">75%    167.500000   65.000000</span>
<span class="go">max    193.900000   89.000000</span>
</pre></div>
</div>
<div class="note admonition">
<p class="admonition-title">更全面的数据汇总</p>
<blockquote>
<div><p><code class="docutils literal notranslate"><span class="pre">info,</span> <span class="pre">describe</span></code> 只能实现较少信息的展示，如果想要对一份数据集进行全面且有效的观察，特别是在列较多的情况下，推荐使用 <a class="reference external" href="https://pandas-profiling.github.io/pandas-profiling/docs/master/index.html">pandas-profiling</a> 包，它将在第十一章被再次提到。</p>
</div></blockquote>
</div>
</section>
<section id="id7">
<h3>2. 特征统计函数<a class="headerlink" href="#id7" title="Permalink to this heading">#</a></h3>
<p>在 <code class="docutils literal notranslate"><span class="pre">Series</span></code> 和 <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> 上定义了许多统计函数，最常见的是 <code class="docutils literal notranslate"><span class="pre">sum,</span> <span class="pre">mean,</span> <span class="pre">median,</span> <span class="pre">var,</span> <span class="pre">std,</span> <span class="pre">max,</span> <span class="pre">min</span></code> 。例如，选出身高和体重列进行演示：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [50]: </span><span class="n">df_demo</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s1">&#39;Height&#39;</span><span class="p">,</span> <span class="s1">&#39;Weight&#39;</span><span class="p">]]</span>

<span class="gp">In [51]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="gh">Out[51]: </span>
<span class="go">Height    163.218033</span>
<span class="go">Weight     55.015873</span>
<span class="go">dtype: float64</span>

<span class="gp">In [52]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">max</span><span class="p">()</span>
<span class="gh">Out[52]: </span>
<span class="go">Height    193.9</span>
<span class="go">Weight     89.0</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p>此外，需要介绍的是 <code class="docutils literal notranslate"><span class="pre">quantile,</span> <span class="pre">count,</span> <span class="pre">idxmax</span></code> 这三个函数，它们分别返回的是分位数、非缺失值个数、最大值对应的索引：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [53]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">quantile</span><span class="p">(</span><span class="mf">0.75</span><span class="p">)</span>
<span class="gh">Out[53]: </span>
<span class="go">Height    167.5</span>
<span class="go">Weight     65.0</span>
<span class="go">Name: 0.75, dtype: float64</span>

<span class="gp">In [54]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="gh">Out[54]: </span>
<span class="go">Height    183</span>
<span class="go">Weight    189</span>
<span class="go">dtype: int64</span>

<span class="gp">In [55]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">idxmax</span><span class="p">()</span> <span class="c1"># idxmin是对应的函数</span>
<span class="gh">Out[55]: </span>
<span class="go">Height    193</span>
<span class="go">Weight      2</span>
<span class="go">dtype: int64</span>
</pre></div>
</div>
<p>上面这些所有的函数，由于操作后返回的是标量，所以又称为聚合函数，它们有一个公共参数 <code class="docutils literal notranslate"><span class="pre">axis</span></code> ，默认为0代表逐列聚合，如果设置为1则表示逐行聚合：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [56]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span> <span class="c1"># 在这个数据集上体重和身高的均值并没有意义</span>
<span class="gh">Out[56]: </span>
<span class="go">0    102.45</span>
<span class="go">1    118.25</span>
<span class="go">2    138.95</span>
<span class="go">3     41.00</span>
<span class="go">4    124.00</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
</section>
<section id="id8">
<h3>3. 唯一值函数<a class="headerlink" href="#id8" title="Permalink to this heading">#</a></h3>
<p>对序列使用 <code class="docutils literal notranslate"><span class="pre">unique</span></code> 和 <code class="docutils literal notranslate"><span class="pre">nunique</span></code> 可以分别得到其唯一值组成的列表和唯一值的个数：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [57]: </span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;School&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">()</span>
<span class="gh">Out[57]: </span>
<span class="go">array([&#39;Shanghai Jiao Tong University&#39;, &#39;Peking University&#39;,</span>
<span class="go">       &#39;Fudan University&#39;, &#39;Tsinghua University&#39;], dtype=object)</span>

<span class="gp">In [58]: </span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;School&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
<span class="gh">Out[58]: </span><span class="go">4</span>
</pre></div>
</div>
<p><code class="docutils literal notranslate"><span class="pre">value_counts</span></code> 可以得到唯一值和其对应出现的频数：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [59]: </span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;School&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
<span class="gh">Out[59]: </span>
<span class="go">Tsinghua University              69</span>
<span class="go">Shanghai Jiao Tong University    57</span>
<span class="go">Fudan University                 40</span>
<span class="go">Peking University                34</span>
<span class="go">Name: School, dtype: int64</span>
</pre></div>
</div>
<p>如果想要观察多个列组合的唯一值，可以使用 <code class="docutils literal notranslate"><span class="pre">drop_duplicates</span></code> 。其中的关键参数是 <code class="docutils literal notranslate"><span class="pre">keep</span></code> ，默认值 <code class="docutils literal notranslate"><span class="pre">first</span></code> 表示每个组合保留第一次出现的所在行， <code class="docutils literal notranslate"><span class="pre">last</span></code> 表示保留最后一次出现的所在行， <code class="docutils literal notranslate"><span class="pre">False</span></code> 表示把所有重复组合所在的行剔除。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [60]: </span><span class="n">df_demo</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s1">&#39;Gender&#39;</span><span class="p">,</span><span class="s1">&#39;Transfer&#39;</span><span class="p">,</span><span class="s1">&#39;Name&#39;</span><span class="p">]]</span>

<span class="gp">In [61]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">([</span><span class="s1">&#39;Gender&#39;</span><span class="p">,</span> <span class="s1">&#39;Transfer&#39;</span><span class="p">])</span>
<span class="gh">Out[61]: </span>
<span class="go">    Gender Transfer            Name</span>
<span class="go">0   Female        N    Gaopeng Yang</span>
<span class="go">1     Male        N  Changqiang You</span>
<span class="go">12  Female      NaN        Peng You</span>
<span class="go">21    Male      NaN   Xiaopeng Shen</span>
<span class="go">36    Male        Y    Xiaojuan Qin</span>
<span class="go">43  Female        Y      Gaoli Feng</span>

<span class="gp">In [62]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">([</span><span class="s1">&#39;Gender&#39;</span><span class="p">,</span> <span class="s1">&#39;Transfer&#39;</span><span class="p">],</span> <span class="n">keep</span><span class="o">=</span><span class="s1">&#39;last&#39;</span><span class="p">)</span>
<span class="gh">Out[62]: </span>
<span class="go">     Gender Transfer            Name</span>
<span class="go">147    Male      NaN        Juan You</span>
<span class="go">150    Male        Y   Chengpeng You</span>
<span class="go">169  Female        Y   Chengquan Qin</span>
<span class="go">194  Female      NaN     Yanmei Qian</span>
<span class="go">197  Female        N  Chengqiang Chu</span>
<span class="go">199    Male        N     Chunpeng Lv</span>

<span class="gp">In [63]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">([</span><span class="s1">&#39;Name&#39;</span><span class="p">,</span> <span class="s1">&#39;Gender&#39;</span><span class="p">],</span>
<span class="gp">   ....: </span>                     <span class="n">keep</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span> <span class="c1"># 保留只出现过一次的性别和姓名组合</span>
<span class="gp">   ....: </span>
<span class="gh">Out[63]: </span>
<span class="go">   Gender Transfer            Name</span>
<span class="go">0  Female        N    Gaopeng Yang</span>
<span class="go">1    Male        N  Changqiang You</span>
<span class="go">2    Male        N         Mei Sun</span>
<span class="go">4    Male        N     Gaojuan You</span>
<span class="go">5  Female        N     Xiaoli Qian</span>

<span class="gp">In [64]: </span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;School&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">()</span> <span class="c1"># 在Series上也可以使用</span>
<span class="gh">Out[64]: </span>
<span class="go">0    Shanghai Jiao Tong University</span>
<span class="go">1                Peking University</span>
<span class="go">3                 Fudan University</span>
<span class="go">5              Tsinghua University</span>
<span class="go">Name: School, dtype: object</span>
</pre></div>
</div>
<p>此外， <code class="docutils literal notranslate"><span class="pre">duplicated</span></code> 和 <code class="docutils literal notranslate"><span class="pre">drop_duplicates</span></code> 的功能类似，但前者返回了是否为唯一值的布尔列表，其 <code class="docutils literal notranslate"><span class="pre">keep</span></code> 参数与后者一致。其返回的序列，把重复元素设为 <code class="docutils literal notranslate"><span class="pre">True</span></code> ，否则为 <code class="docutils literal notranslate"><span class="pre">False</span></code> 。 <code class="docutils literal notranslate"><span class="pre">drop_duplicates</span></code> 等价于把 <code class="docutils literal notranslate"><span class="pre">duplicated</span></code> 为 <code class="docutils literal notranslate"><span class="pre">True</span></code> 的对应行剔除。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [65]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">duplicated</span><span class="p">([</span><span class="s1">&#39;Gender&#39;</span><span class="p">,</span> <span class="s1">&#39;Transfer&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[65]: </span>
<span class="go">0    False</span>
<span class="go">1    False</span>
<span class="go">2     True</span>
<span class="go">3     True</span>
<span class="go">4     True</span>
<span class="go">dtype: bool</span>

<span class="gp">In [66]: </span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;School&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">duplicated</span><span class="p">()</span><span class="o">.</span><span class="n">head</span><span class="p">()</span> <span class="c1"># 在Series上也可以使用</span>
<span class="gh">Out[66]: </span>
<span class="go">0    False</span>
<span class="go">1    False</span>
<span class="go">2     True</span>
<span class="go">3    False</span>
<span class="go">4     True</span>
<span class="go">Name: School, dtype: bool</span>
</pre></div>
</div>
</section>
<section id="id9">
<h3>4. 替换函数<a class="headerlink" href="#id9" title="Permalink to this heading">#</a></h3>
<p>一般而言，替换操作是针对某一个列进行的，因此下面的例子都以 <code class="docutils literal notranslate"><span class="pre">Series</span></code> 举例。 <code class="docutils literal notranslate"><span class="pre">pandas</span></code> 中的替换函数可以归纳为三类：映射替换、逻辑替换、数值替换。其中映射替换包含 <code class="docutils literal notranslate"><span class="pre">replace</span></code> 方法、第八章中的 <code class="docutils literal notranslate"><span class="pre">str.replace</span></code> 方法以及第九章中的 <code class="docutils literal notranslate"><span class="pre">cat.codes</span></code> 方法，此处介绍 <code class="docutils literal notranslate"><span class="pre">replace</span></code> 的用法。</p>
<p>在 <code class="docutils literal notranslate"><span class="pre">replace</span></code> 中，可以通过字典构造，或者传入两个列表来进行替换：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [67]: </span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;Gender&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">replace</span><span class="p">({</span><span class="s1">&#39;Female&#39;</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span> <span class="s1">&#39;Male&#39;</span><span class="p">:</span><span class="mi">1</span><span class="p">})</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[67]: </span>
<span class="go">0    0</span>
<span class="go">1    1</span>
<span class="go">2    1</span>
<span class="go">3    0</span>
<span class="go">4    1</span>
<span class="go">Name: Gender, dtype: int64</span>

<span class="gp">In [68]: </span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;Gender&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">replace</span><span class="p">([</span><span class="s1">&#39;Female&#39;</span><span class="p">,</span> <span class="s1">&#39;Male&#39;</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[68]: </span>
<span class="go">0    0</span>
<span class="go">1    1</span>
<span class="go">2    1</span>
<span class="go">3    0</span>
<span class="go">4    1</span>
<span class="go">Name: Gender, dtype: int64</span>
</pre></div>
</div>
<p>另外， <code class="docutils literal notranslate"><span class="pre">replace</span></code> 还有一种特殊的方向替换，指定 <code class="docutils literal notranslate"><span class="pre">method</span></code> 参数为 <code class="docutils literal notranslate"><span class="pre">ffill</span></code> 则为用前面一个最近的未被替换的值进行替换， <code class="docutils literal notranslate"><span class="pre">bfill</span></code> 则使用后面最近的未被替换的值进行替换。从下面的例子可以看到，它们的结果是不同的：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [69]: </span><span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">([</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="s1">&#39;b&#39;</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="s1">&#39;a&#39;</span><span class="p">])</span>

<span class="gp">In [70]: </span><span class="n">s</span><span class="o">.</span><span class="n">replace</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span> <span class="n">method</span><span class="o">=</span><span class="s1">&#39;ffill&#39;</span><span class="p">)</span>
<span class="gh">Out[70]: </span>
<span class="go">0    a</span>
<span class="go">1    a</span>
<span class="go">2    b</span>
<span class="go">3    b</span>
<span class="go">4    b</span>
<span class="go">5    b</span>
<span class="go">6    a</span>
<span class="go">dtype: object</span>

<span class="gp">In [71]: </span><span class="n">s</span><span class="o">.</span><span class="n">replace</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span> <span class="n">method</span><span class="o">=</span><span class="s1">&#39;bfill&#39;</span><span class="p">)</span>
<span class="gh">Out[71]: </span>
<span class="go">0    a</span>
<span class="go">1    b</span>
<span class="go">2    b</span>
<span class="go">3    a</span>
<span class="go">4    a</span>
<span class="go">5    a</span>
<span class="go">6    a</span>
<span class="go">dtype: object</span>
</pre></div>
</div>
<div class="caution admonition">
<p class="admonition-title">正则替换请使用 <code class="docutils literal notranslate"><span class="pre">str.replace</span></code></p>
<blockquote>
<div><p>虽然对于 <code class="docutils literal notranslate"><span class="pre">replace</span></code> 而言可以使用正则替换，但是当前版本下对于 <code class="docutils literal notranslate"><span class="pre">string</span></code> 类型的正则替换还存在 <a class="reference external" href="https://github.com/pandas-dev/pandas/pull/36038">bug</a> ，因此如有此需求，请选择 <code class="docutils literal notranslate"><span class="pre">str.replace</span></code> 进行替换操作，具体的方式将在第八章中讲解。</p>
</div></blockquote>
</div>
<p>逻辑替换包括了 <code class="docutils literal notranslate"><span class="pre">where</span></code> 和 <code class="docutils literal notranslate"><span class="pre">mask</span></code> ，这两个函数是完全对称的： <code class="docutils literal notranslate"><span class="pre">where</span></code> 函数在传入条件为 <code class="docutils literal notranslate"><span class="pre">False</span></code> 的对应行进行替换，而 <code class="docutils literal notranslate"><span class="pre">mask</span></code> 在传入条件为 <code class="docutils literal notranslate"><span class="pre">True</span></code> 的对应行进行替换，当不指定替换值时，替换为缺失值。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [72]: </span><span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">([</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mf">1.2345</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="o">-</span><span class="mi">50</span><span class="p">])</span>

<span class="gp">In [73]: </span><span class="n">s</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">s</span><span class="o">&lt;</span><span class="mi">0</span><span class="p">)</span>
<span class="gh">Out[73]: </span>
<span class="go">0    -1.0</span>
<span class="go">1     NaN</span>
<span class="go">2     NaN</span>
<span class="go">3   -50.0</span>
<span class="go">dtype: float64</span>

<span class="gp">In [74]: </span><span class="n">s</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">s</span><span class="o">&lt;</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
<span class="gh">Out[74]: </span>
<span class="go">0     -1.0</span>
<span class="go">1    100.0</span>
<span class="go">2    100.0</span>
<span class="go">3    -50.0</span>
<span class="go">dtype: float64</span>

<span class="gp">In [75]: </span><span class="n">s</span><span class="o">.</span><span class="n">mask</span><span class="p">(</span><span class="n">s</span><span class="o">&lt;</span><span class="mi">0</span><span class="p">)</span>
<span class="gh">Out[75]: </span>
<span class="go">0         NaN</span>
<span class="go">1      1.2345</span>
<span class="go">2    100.0000</span>
<span class="go">3         NaN</span>
<span class="go">dtype: float64</span>

<span class="gp">In [76]: </span><span class="n">s</span><span class="o">.</span><span class="n">mask</span><span class="p">(</span><span class="n">s</span><span class="o">&lt;</span><span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">50</span><span class="p">)</span>
<span class="gh">Out[76]: </span>
<span class="go">0    -50.0000</span>
<span class="go">1      1.2345</span>
<span class="go">2    100.0000</span>
<span class="go">3    -50.0000</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p>需要注意的是，传入的条件只需是与被调用的 <code class="docutils literal notranslate"><span class="pre">Series</span></code> 索引一致的布尔序列即可：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [77]: </span><span class="n">s_condition</span><span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">([</span><span class="kc">True</span><span class="p">,</span><span class="kc">False</span><span class="p">,</span><span class="kc">False</span><span class="p">,</span><span class="kc">True</span><span class="p">],</span><span class="n">index</span><span class="o">=</span><span class="n">s</span><span class="o">.</span><span class="n">index</span><span class="p">)</span>

<span class="gp">In [78]: </span><span class="n">s</span><span class="o">.</span><span class="n">mask</span><span class="p">(</span><span class="n">s_condition</span><span class="p">,</span> <span class="o">-</span><span class="mi">50</span><span class="p">)</span>
<span class="gh">Out[78]: </span>
<span class="go">0    -50.0000</span>
<span class="go">1      1.2345</span>
<span class="go">2    100.0000</span>
<span class="go">3    -50.0000</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p>数值替换包含了 <code class="docutils literal notranslate"><span class="pre">round,</span> <span class="pre">abs,</span> <span class="pre">clip</span></code> 方法，它们分别表示按照给定精度四舍五入、取绝对值和截断：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [79]: </span><span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">([</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mf">1.2345</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="o">-</span><span class="mi">50</span><span class="p">])</span>

<span class="gp">In [80]: </span><span class="n">s</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="gh">Out[80]: </span>
<span class="go">0     -1.00</span>
<span class="go">1      1.23</span>
<span class="go">2    100.00</span>
<span class="go">3    -50.00</span>
<span class="go">dtype: float64</span>

<span class="gp">In [81]: </span><span class="n">s</span><span class="o">.</span><span class="n">abs</span><span class="p">()</span>
<span class="gh">Out[81]: </span>
<span class="go">0      1.0000</span>
<span class="go">1      1.2345</span>
<span class="go">2    100.0000</span>
<span class="go">3     50.0000</span>
<span class="go">dtype: float64</span>

<span class="gp">In [82]: </span><span class="n">s</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="c1"># 前两个数分别表示上下截断边界</span>
<span class="gh">Out[82]: </span>
<span class="go">0    0.0000</span>
<span class="go">1    1.2345</span>
<span class="go">2    2.0000</span>
<span class="go">3    0.0000</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<div class="hint admonition">
<p class="admonition-title">练一练</p>
<blockquote>
<div><p>在 <code class="docutils literal notranslate"><span class="pre">clip</span></code> 中，超过边界的只能截断为边界值，如果要把超出边界的替换为自定义的值，应当如何做？</p>
</div></blockquote>
</div>
</section>
<section id="id10">
<h3>5. 排序函数<a class="headerlink" href="#id10" title="Permalink to this heading">#</a></h3>
<p>排序共有两种方式，其一为值排序，其二为索引排序，对应的函数是 <code class="docutils literal notranslate"><span class="pre">sort_values</span></code> 和 <code class="docutils literal notranslate"><span class="pre">sort_index</span></code> 。</p>
<p>为了演示排序函数，下面先利用 <code class="docutils literal notranslate"><span class="pre">set_index</span></code> 方法把年级和姓名两列作为索引，多级索引的内容和索引设置的方法将在第三章进行详细讲解。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [83]: </span><span class="n">df_demo</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s1">&#39;Grade&#39;</span><span class="p">,</span> <span class="s1">&#39;Name&#39;</span><span class="p">,</span> <span class="s1">&#39;Height&#39;</span><span class="p">,</span>
<span class="gp">   ....: </span>              <span class="s1">&#39;Weight&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">set_index</span><span class="p">([</span><span class="s1">&#39;Grade&#39;</span><span class="p">,</span><span class="s1">&#39;Name&#39;</span><span class="p">])</span>
<span class="gp">   ....: </span>
</pre></div>
</div>
<p>对身高进行排序，默认参数 <code class="docutils literal notranslate"><span class="pre">ascending=True</span></code> 为升序：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [84]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s1">&#39;Height&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[84]: </span>
<span class="go">                         Height  Weight</span>
<span class="go">Grade     Name                         </span>
<span class="go">Junior    Xiaoli Chu      145.4    34.0</span>
<span class="go">Senior    Gaomei Lv       147.3    34.0</span>
<span class="go">Sophomore Peng Han        147.8    34.0</span>
<span class="go">Senior    Changli Lv      148.7    41.0</span>
<span class="go">Sophomore Changjuan You   150.5    40.0</span>

<span class="gp">In [85]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s1">&#39;Height&#39;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[85]: </span>
<span class="go">                        Height  Weight</span>
<span class="go">Grade    Name                         </span>
<span class="go">Senior   Xiaoqiang Qin   193.9    79.0</span>
<span class="go">         Mei Sun         188.9    89.0</span>
<span class="go">         Gaoli Zhao      186.5    83.0</span>
<span class="go">Freshman Qiang Han       185.3    87.0</span>
<span class="go">Senior   Qiang Zheng     183.9    87.0</span>
</pre></div>
</div>
<p>在排序中，经常遇到多列排序的问题，比如在体重相同的情况下，对身高进行排序，并且保持身高降序排列，体重升序排列：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [86]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s1">&#39;Weight&#39;</span><span class="p">,</span><span class="s1">&#39;Height&#39;</span><span class="p">],</span><span class="n">ascending</span><span class="o">=</span><span class="p">[</span><span class="kc">True</span><span class="p">,</span><span class="kc">False</span><span class="p">])</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[86]: </span>
<span class="go">                       Height  Weight</span>
<span class="go">Grade     Name                       </span>
<span class="go">Sophomore Peng Han      147.8    34.0</span>
<span class="go">Senior    Gaomei Lv     147.3    34.0</span>
<span class="go">Junior    Xiaoli Chu    145.4    34.0</span>
<span class="go">Sophomore Qiang Zhou    150.5    36.0</span>
<span class="go">Freshman  Yanqiang Xu   152.4    38.0</span>
</pre></div>
</div>
<p>索引排序的用法和值排序完全一致，只不过元素的值在索引中，此时需要指定索引层的名字或者层号，用参数 <code class="docutils literal notranslate"><span class="pre">level</span></code> 表示。另外，需要注意的是字符串的排列顺序由字母顺序决定。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [87]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;Grade&#39;</span><span class="p">,</span><span class="s1">&#39;Name&#39;</span><span class="p">],</span><span class="n">ascending</span><span class="o">=</span><span class="p">[</span><span class="kc">True</span><span class="p">,</span><span class="kc">False</span><span class="p">])</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[87]: </span>
<span class="go">                        Height  Weight</span>
<span class="go">Grade    Name                         </span>
<span class="go">Freshman Yanquan Wang    163.5    55.0</span>
<span class="go">         Yanqiang Xu     152.4    38.0</span>
<span class="go">         Yanqiang Feng   162.3    51.0</span>
<span class="go">         Yanpeng Lv        NaN    65.0</span>
<span class="go">         Yanli Zhang     165.1    52.0</span>
</pre></div>
</div>
</section>
<section id="apply">
<h3>6. apply方法<a class="headerlink" href="#apply" title="Permalink to this heading">#</a></h3>
<p><code class="docutils literal notranslate"><span class="pre">apply</span></code> 方法常用于 <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> 的行迭代或者列迭代，它的 <code class="docutils literal notranslate"><span class="pre">axis</span></code> 含义与第2小节中的统计聚合函数一致， <code class="docutils literal notranslate"><span class="pre">apply</span></code> 的参数往往是一个以序列为输入的函数。例如对于 <code class="docutils literal notranslate"><span class="pre">.mean()</span></code> ，使用 <code class="docutils literal notranslate"><span class="pre">apply</span></code> 可以如下地写出：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [88]: </span><span class="n">df_demo</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s1">&#39;Height&#39;</span><span class="p">,</span> <span class="s1">&#39;Weight&#39;</span><span class="p">]]</span>

<span class="gp">In [89]: </span><span class="k">def</span> <span class="nf">my_mean</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="gp">   ....: </span>    <span class="n">res</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="gp">   ....: </span>    <span class="k">return</span> <span class="n">res</span>
<span class="gp">   ....: </span>

<span class="gp">In [90]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">my_mean</span><span class="p">)</span>
<span class="gh">Out[90]: </span>
<span class="go">Height    163.218033</span>
<span class="go">Weight     55.015873</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p>同样的，可以利用 <code class="docutils literal notranslate"><span class="pre">lambda</span></code> 表达式使得书写简洁，这里的 <code class="docutils literal notranslate"><span class="pre">x</span></code> 就指代被调用的 <code class="docutils literal notranslate"><span class="pre">df_demo</span></code> 表中逐个输入的序列：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [91]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span><span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
<span class="gh">Out[91]: </span>
<span class="go">Height    163.218033</span>
<span class="go">Weight     55.015873</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p>若指定 <code class="docutils literal notranslate"><span class="pre">axis=1</span></code> ，那么每次传入函数的就是行元素组成的 <code class="docutils literal notranslate"><span class="pre">Series</span></code> ，其结果与之前的逐行均值结果一致。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [92]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span><span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">(),</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[92]: </span>
<span class="go">0    102.45</span>
<span class="go">1    118.25</span>
<span class="go">2    138.95</span>
<span class="go">3     41.00</span>
<span class="go">4    124.00</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p>这里再举一个例子： <code class="docutils literal notranslate"><span class="pre">mad</span></code> 函数返回的是一个序列中偏离该序列均值的绝对值大小的均值，例如序列1,3,7,10中，均值为5.25，每一个元素偏离的绝对值为4.25,2.25,1.75,4.75，这个偏离序列的均值为3.25。现在利用 <code class="docutils literal notranslate"><span class="pre">apply</span></code> 计算升高和体重的 <code class="docutils literal notranslate"><span class="pre">mad</span></code> 指标：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [93]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:(</span><span class="n">x</span><span class="o">-</span><span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span><span class="o">.</span><span class="n">abs</span><span class="p">()</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
<span class="gh">Out[93]: </span>
<span class="go">Height     6.707229</span>
<span class="go">Weight    10.391870</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p>这与使用内置的 <code class="docutils literal notranslate"><span class="pre">mad</span></code> 函数计算结果一致：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [94]: </span><span class="n">df_demo</span><span class="o">.</span><span class="n">mad</span><span class="p">()</span>
<span class="gh">Out[94]: </span>
<span class="go">Height     6.707229</span>
<span class="go">Weight    10.391870</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<div class="caution admonition">
<p class="admonition-title">谨慎使用 <code class="docutils literal notranslate"><span class="pre">apply</span></code></p>
<blockquote>
<div><p>得益于传入自定义函数的处理， <code class="docutils literal notranslate"><span class="pre">apply</span></code> 的自由度很高，但这是以性能为代价的。一般而言，使用 <code class="docutils literal notranslate"><span class="pre">pandas</span></code> 的内置函数处理和 <code class="docutils literal notranslate"><span class="pre">apply</span></code> 来处理同一个任务，其速度会相差较多，因此只有在确实存在自定义需求的情境下才考虑使用 <code class="docutils literal notranslate"><span class="pre">apply</span></code> 。</p>
</div></blockquote>
</div>
</section>
</section>
<section id="id11">
<h2>四、窗口对象<a class="headerlink" href="#id11" title="Permalink to this heading">#</a></h2>
<p><code class="docutils literal notranslate"><span class="pre">pandas</span></code> 中有3类窗口，分别是滑动窗口 <code class="docutils literal notranslate"><span class="pre">rolling</span></code> 、扩张窗口 <code class="docutils literal notranslate"><span class="pre">expanding</span></code> 以及指数加权窗口 <code class="docutils literal notranslate"><span class="pre">ewm</span></code> 。需要说明的是，以日期偏置为窗口大小的滑动窗口将在第十章讨论，指数加权窗口见本章练习。</p>
<section id="id12">
<h3>1. 滑窗对象<a class="headerlink" href="#id12" title="Permalink to this heading">#</a></h3>
<p>要使用滑窗函数，就必须先要对一个序列使用 <code class="docutils literal notranslate"><span class="pre">.rolling</span></code> 得到滑窗对象，其最重要的参数为窗口大小 <code class="docutils literal notranslate"><span class="pre">window</span></code> 。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [95]: </span><span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">])</span>

<span class="gp">In [96]: </span><span class="n">roller</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">rolling</span><span class="p">(</span><span class="n">window</span> <span class="o">=</span> <span class="mi">3</span><span class="p">)</span>

<span class="gp">In [97]: </span><span class="n">roller</span>
<span class="gh">Out[97]: </span><span class="go">Rolling [window=3,center=False,axis=0]</span>
</pre></div>
</div>
<p>在得到了滑窗对象后，能够使用相应的聚合函数进行计算，需要注意的是窗口包含当前行所在的元素，例如在第四个位置进行均值运算时，应当计算(2+3+4)/3，而不是(1+2+3)/3：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [98]: </span><span class="n">roller</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="gh">Out[98]: </span>
<span class="go">0    NaN</span>
<span class="go">1    NaN</span>
<span class="go">2    2.0</span>
<span class="go">3    3.0</span>
<span class="go">4    4.0</span>
<span class="go">dtype: float64</span>

<span class="gp">In [99]: </span><span class="n">roller</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="gh">Out[99]: </span>
<span class="go">0     NaN</span>
<span class="go">1     NaN</span>
<span class="go">2     6.0</span>
<span class="go">3     9.0</span>
<span class="go">4    12.0</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p>对于滑动相关系数或滑动协方差的计算，可以如下写出：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [100]: </span><span class="n">s2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">16</span><span class="p">,</span><span class="mi">30</span><span class="p">])</span>

<span class="gp">In [101]: </span><span class="n">roller</span><span class="o">.</span><span class="n">cov</span><span class="p">(</span><span class="n">s2</span><span class="p">)</span>
<span class="gh">Out[101]: </span>
<span class="go">0     NaN</span>
<span class="go">1     NaN</span>
<span class="go">2     2.5</span>
<span class="go">3     7.0</span>
<span class="go">4    12.0</span>
<span class="go">dtype: float64</span>

<span class="gp">In [102]: </span><span class="n">roller</span><span class="o">.</span><span class="n">corr</span><span class="p">(</span><span class="n">s2</span><span class="p">)</span>
<span class="gh">Out[102]: </span>
<span class="go">0         NaN</span>
<span class="go">1         NaN</span>
<span class="go">2    0.944911</span>
<span class="go">3    0.970725</span>
<span class="go">4    0.995402</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p>此外，还支持使用 <code class="docutils literal notranslate"><span class="pre">apply</span></code> 传入自定义函数，其传入值是对应窗口的 <code class="docutils literal notranslate"><span class="pre">Series</span></code> ，例如上述的均值函数可以等效表示：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [103]: </span><span class="n">roller</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span><span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
<span class="gh">Out[103]: </span>
<span class="go">0    NaN</span>
<span class="go">1    NaN</span>
<span class="go">2    2.0</span>
<span class="go">3    3.0</span>
<span class="go">4    4.0</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p><code class="docutils literal notranslate"><span class="pre">shift,</span> <span class="pre">diff,</span> <span class="pre">pct_change</span></code> 是一组类滑窗函数，它们的公共参数为 <code class="docutils literal notranslate"><span class="pre">periods=n</span></code> ，默认为1，分别表示取向前第 <code class="docutils literal notranslate"><span class="pre">n</span></code> 个元素的值、与向前第 <code class="docutils literal notranslate"><span class="pre">n</span></code> 个元素做差（与 <code class="docutils literal notranslate"><span class="pre">Numpy</span></code> 中不同，后者表示 <code class="docutils literal notranslate"><span class="pre">n</span></code> 阶差分）、与向前第 <code class="docutils literal notranslate"><span class="pre">n</span></code> 个元素相比计算增长率。这里的 <code class="docutils literal notranslate"><span class="pre">n</span></code> 可以为负，表示反方向的类似操作。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [104]: </span><span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">15</span><span class="p">])</span>

<span class="gp">In [105]: </span><span class="n">s</span><span class="o">.</span><span class="n">shift</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="gh">Out[105]: </span>
<span class="go">0    NaN</span>
<span class="go">1    NaN</span>
<span class="go">2    1.0</span>
<span class="go">3    3.0</span>
<span class="go">4    6.0</span>
<span class="go">dtype: float64</span>

<span class="gp">In [106]: </span><span class="n">s</span><span class="o">.</span><span class="n">diff</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
<span class="gh">Out[106]: </span>
<span class="go">0     NaN</span>
<span class="go">1     NaN</span>
<span class="go">2     NaN</span>
<span class="go">3     9.0</span>
<span class="go">4    12.0</span>
<span class="go">dtype: float64</span>

<span class="gp">In [107]: </span><span class="n">s</span><span class="o">.</span><span class="n">pct_change</span><span class="p">()</span>
<span class="gh">Out[107]: </span>
<span class="go">0         NaN</span>
<span class="go">1    2.000000</span>
<span class="go">2    1.000000</span>
<span class="go">3    0.666667</span>
<span class="go">4    0.500000</span>
<span class="go">dtype: float64</span>

<span class="gp">In [108]: </span><span class="n">s</span><span class="o">.</span><span class="n">shift</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="gh">Out[108]: </span>
<span class="go">0     3.0</span>
<span class="go">1     6.0</span>
<span class="go">2    10.0</span>
<span class="go">3    15.0</span>
<span class="go">4     NaN</span>
<span class="go">dtype: float64</span>

<span class="gp">In [109]: </span><span class="n">s</span><span class="o">.</span><span class="n">diff</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">)</span>
<span class="gh">Out[109]: </span>
<span class="go">0   -5.0</span>
<span class="go">1   -7.0</span>
<span class="go">2   -9.0</span>
<span class="go">3    NaN</span>
<span class="go">4    NaN</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p>将其视作类滑窗函数的原因是，它们的功能可以用窗口大小为 <code class="docutils literal notranslate"><span class="pre">n+1</span></code> 的 <code class="docutils literal notranslate"><span class="pre">rolling</span></code> 方法等价代替：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [110]: </span><span class="n">s</span><span class="o">.</span><span class="n">rolling</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span><span class="nb">list</span><span class="p">(</span><span class="n">x</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span> <span class="c1"># s.shift(2)</span>
<span class="gh">Out[110]: </span>
<span class="go">0    NaN</span>
<span class="go">1    NaN</span>
<span class="go">2    1.0</span>
<span class="go">3    3.0</span>
<span class="go">4    6.0</span>
<span class="go">dtype: float64</span>

<span class="gp">In [111]: </span><span class="n">s</span><span class="o">.</span><span class="n">rolling</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span><span class="nb">list</span><span class="p">(</span><span class="n">x</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">-</span><span class="nb">list</span><span class="p">(</span><span class="n">x</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span> <span class="c1"># s.diff(3)</span>
<span class="gh">Out[111]: </span>
<span class="go">0     NaN</span>
<span class="go">1     NaN</span>
<span class="go">2     NaN</span>
<span class="go">3     9.0</span>
<span class="go">4    12.0</span>
<span class="go">dtype: float64</span>

<span class="gp">In [112]: </span><span class="k">def</span> <span class="nf">my_pct</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="gp">   .....: </span>    <span class="n">L</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="gp">   .....: </span>    <span class="k">return</span> <span class="n">L</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">/</span><span class="n">L</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">-</span><span class="mi">1</span>
<span class="gp">   .....: </span>

<span class="gp">In [113]: </span><span class="n">s</span><span class="o">.</span><span class="n">rolling</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">my_pct</span><span class="p">)</span> <span class="c1"># s.pct_change()</span>
<span class="gh">Out[113]: </span>
<span class="go">0         NaN</span>
<span class="go">1    2.000000</span>
<span class="go">2    1.000000</span>
<span class="go">3    0.666667</span>
<span class="go">4    0.500000</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<div class="hint admonition">
<p class="admonition-title">练一练</p>
<blockquote>
<div><p><code class="docutils literal notranslate"><span class="pre">rolling</span></code> 对象的默认窗口方向都是向前的，某些情况下用户需要向后的窗口，例如对1,2,3设定向后窗口为2的 <code class="docutils literal notranslate"><span class="pre">sum</span></code> 操作，结果为3,5,NaN，此时应该如何实现向后的滑窗操作？</p>
</div></blockquote>
</div>
</section>
<section id="id13">
<h3>2. 扩张窗口<a class="headerlink" href="#id13" title="Permalink to this heading">#</a></h3>
<p>扩张窗口又称累计窗口，可以理解为一个动态长度的窗口，其窗口的大小就是从序列开始处到具体操作的对应位置，其使用的聚合函数会作用于这些逐步扩张的窗口上。具体地说，设序列为a1, a2, a3, a4，则其每个位置对应的窗口即[a1]、[a1, a2]、[a1, a2, a3]、[a1, a2, a3, a4]。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [114]: </span><span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>

<span class="gp">In [115]: </span><span class="n">s</span><span class="o">.</span><span class="n">expanding</span><span class="p">()</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="gh">Out[115]: </span>
<span class="go">0    1.000000</span>
<span class="go">1    2.000000</span>
<span class="go">2    3.333333</span>
<span class="go">3    5.000000</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<div class="hint admonition">
<p class="admonition-title">练一练</p>
<blockquote>
<div><p><code class="docutils literal notranslate"><span class="pre">cummax,</span> <span class="pre">cumsum,</span> <span class="pre">cumprod</span></code> 函数是典型的类扩张窗口函数，请使用 <code class="docutils literal notranslate"><span class="pre">expanding</span></code> 对象依次实现它们。</p>
</div></blockquote>
</div>
</section>
</section>
<section id="id14">
<h2>五、练习<a class="headerlink" href="#id14" title="Permalink to this heading">#</a></h2>
<section id="ex1">
<h3>Ex1：口袋妖怪数据集<a class="headerlink" href="#ex1" title="Permalink to this heading">#</a></h3>
<p>现有一份口袋妖怪的数据集，下面进行一些背景说明：</p>
<ul class="simple">
<li><p><code class="docutils literal notranslate"><span class="pre">#</span></code> 代表全国图鉴编号，不同行存在相同数字则表示为该妖怪的不同状态</p></li>
<li><p>妖怪具有单属性和双属性两种，对于单属性的妖怪， <code class="docutils literal notranslate"><span class="pre">Type</span> <span class="pre">2</span></code> 为缺失值</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">Total,</span> <span class="pre">HP,</span> <span class="pre">Attack,</span> <span class="pre">Defense,</span> <span class="pre">Sp.</span> <span class="pre">Atk,</span> <span class="pre">Sp.</span> <span class="pre">Def,</span> <span class="pre">Speed</span></code> 分别代表种族值、体力、物攻、防御、特攻、特防、速度，其中种族值为后6项之和</p></li>
</ul>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [116]: </span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/pokemon.csv&#39;</span><span class="p">)</span>

<span class="gp">In [117]: </span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
<span class="gh">Out[117]: </span>
<span class="go">   #       Name Type 1  Type 2  Total  HP  Attack  Defense  Sp. Atk  Sp. Def  Speed</span>
<span class="go">0  1  Bulbasaur  Grass  Poison    318  45      49       49       65       65     45</span>
<span class="go">1  2    Ivysaur  Grass  Poison    405  60      62       63       80       80     60</span>
<span class="go">2  3   Venusaur  Grass  Poison    525  80      82       83      100      100     80</span>
</pre></div>
</div>
<ol class="arabic simple">
<li><p>对 <code class="docutils literal notranslate"><span class="pre">HP,</span> <span class="pre">Attack,</span> <span class="pre">Defense,</span> <span class="pre">Sp.</span> <span class="pre">Atk,</span> <span class="pre">Sp.</span> <span class="pre">Def,</span> <span class="pre">Speed</span></code> 进行加总，验证是否为 <code class="docutils literal notranslate"><span class="pre">Total</span></code> 值。</p></li>
<li><p>对于 <code class="docutils literal notranslate"><span class="pre">#</span></code> 重复的妖怪只保留第一条记录，解决以下问题：</p></li>
</ol>
<ol class="loweralpha simple">
<li><p>求第一属性的种类数量和前三多数量对应的种类</p></li>
<li><p>求第一属性和第二属性的组合种类</p></li>
<li><p>求尚未出现过的属性组合</p></li>
</ol>
<ol class="arabic simple" start="3">
<li><p>按照下述要求，构造 <code class="docutils literal notranslate"><span class="pre">Series</span></code> ：</p></li>
</ol>
<ol class="loweralpha simple">
<li><p>取出物攻，超过120的替换为 <code class="docutils literal notranslate"><span class="pre">high</span></code> ，不足50的替换为 <code class="docutils literal notranslate"><span class="pre">low</span></code> ，否则设为 <code class="docutils literal notranslate"><span class="pre">mid</span></code></p></li>
<li><p>取出第一属性，分别用 <code class="docutils literal notranslate"><span class="pre">replace</span></code> 和 <code class="docutils literal notranslate"><span class="pre">apply</span></code> 替换所有字母为大写</p></li>
<li><p>求每个妖怪六项能力的离差，即所有能力中偏离中位数最大的值，添加到 <code class="docutils literal notranslate"><span class="pre">df</span></code> 并从大到小排序</p></li>
</ol>
</section>
<section id="ex2">
<h3>Ex2：指数加权窗口<a class="headerlink" href="#ex2" title="Permalink to this heading">#</a></h3>
<ol class="arabic simple">
<li><p>作为扩张窗口的 <code class="docutils literal notranslate"><span class="pre">ewm</span></code> 窗口</p></li>
</ol>
<p>在扩张窗口中，用户可以使用各类函数进行历史的累计指标统计，但这些内置的统计函数往往把窗口中的所有元素赋予了同样的权重。事实上，可以给出不同的权重来赋给窗口中的元素，指数加权窗口就是这样一种特殊的扩张窗口。</p>
<p>其中，最重要的参数是 <code class="docutils literal notranslate"><span class="pre">alpha</span></code> ，它决定了默认情况下的窗口权重为 <span class="math notranslate nohighlight">\(w_i = (1 - \alpha)^i, i\in \{0, 1, ..., t\}\)</span> ，其中 <span class="math notranslate nohighlight">\(i=t\)</span> 表示当前元素， <span class="math notranslate nohighlight">\(i=0\)</span> 表示序列的第一个元素。</p>
<p>从权重公式可以看出，离开当前值越远则权重越小，若记原序列为 <code class="docutils literal notranslate"><span class="pre">x</span></code> ，更新后的当前元素为 <span class="math notranslate nohighlight">\(y_t\)</span> ，此时通过加权公式归一化后可知：</p>
<div class="math notranslate nohighlight">
\[\begin{split}y_t &amp;=\frac{\sum_{i=0}^{t} w_i x_{t-i}}{\sum_{i=0}^{t} w_i} \\
&amp;=\frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2 x_{t-2} + ...
+ (1 - \alpha)^{t} x_{0}}{1 + (1 - \alpha) + (1 - \alpha)^2 + ...
+ (1 - \alpha)^{t}}\\\end{split}\]</div>
<p>对于 <code class="docutils literal notranslate"><span class="pre">Series</span></code> 而言，可以用 <code class="docutils literal notranslate"><span class="pre">ewm</span></code> 对象如下计算指数平滑后的序列：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [118]: </span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>

<span class="gp">In [119]: </span><span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">30</span><span class="p">)</span><span class="o">.</span><span class="n">cumsum</span><span class="p">())</span>

<span class="gp">In [120]: </span><span class="n">s</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[120]: </span>
<span class="go">0   -1</span>
<span class="go">1   -1</span>
<span class="go">2   -2</span>
<span class="go">3   -2</span>
<span class="go">4   -2</span>
<span class="go">dtype: int32</span>

<span class="gp">In [121]: </span><span class="n">s</span><span class="o">.</span><span class="n">ewm</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[121]: </span>
<span class="go">0   -1.000000</span>
<span class="go">1   -1.000000</span>
<span class="go">2   -1.409836</span>
<span class="go">3   -1.609756</span>
<span class="go">4   -1.725845</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p>请用 <code class="docutils literal notranslate"><span class="pre">expanding</span></code> 窗口实现。</p>
<ol class="arabic simple" start="2">
<li><p>作为滑动窗口的 <code class="docutils literal notranslate"><span class="pre">ewm</span></code> 窗口</p></li>
</ol>
<p>从第1问中可以看到， <code class="docutils literal notranslate"><span class="pre">ewm</span></code> 作为一种扩张窗口的特例，只能从序列的第一个元素开始加权。现在希望给定一个限制窗口 <code class="docutils literal notranslate"><span class="pre">n</span></code> ，只对包含自身的最近的 <code class="docutils literal notranslate"><span class="pre">n</span></code> 个元素作为窗口进行滑动加权平滑。请根据滑窗函数，给出新的 <span class="math notranslate nohighlight">\(w_i\)</span> 与 <span class="math notranslate nohighlight">\(y_t\)</span> 的更新公式，并通过 <code class="docutils literal notranslate"><span class="pre">rolling</span></code> 窗口实现这一功能。</p>
</section>
</section>
</section>


              </article>
              

              
          </div>
          
      </div>
    </div>

  
  
  <!-- Scripts loaded after <body> so the DOM is not blocked -->
  <script src="../_static/scripts/pydata-sphinx-theme.js?digest=92025949c220c2e29695"></script>

<footer class="bd-footer"><div class="bd-footer__inner container">
  
  <div class="footer-item">
    <p class="copyright">
    &copy; Copyright 2020-2022, Datawhale, 耿远昊.<br>
</p>
  </div>
  
  <div class="footer-item">
    <p class="sphinx-version">
Created using <a href="http://sphinx-doc.org/">Sphinx</a> 5.0.2.<br>
</p>
  </div>
  
</div>
</footer>
  </body>
</html>